Learning Expected Emphatic Traces for Deep RL
نویسندگان
چکیده
Off-policy sampling and experience replay are key for improving sample efficiency scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known the deadly triad potentially unstable. Recently, it has been shown that stability good performance at scale can be achieved by combining emphatic weightings multi-step updates. This approach, however, generally limited to complete trajectories in order, compute required weighting. In paper we investigate how combine non-sequential, off-line data sampled from a buffer. We develop weighting replay, time-reversed n-step TD algorithm learn show these state reduce variance compared prior approaches, while providing convergence guarantees. tested approach on Atari 2600 video games, observed new X-ETD(n) agent improved over baseline agents, highlighting both scalability broad applicability of our approach.
منابع مشابه
Deep reinforcement learning case study with standard RL testing domains
Reinforcement Learning (RL) is a type of Machine Learning algorithms and it enables the agent to determine the ideal behavior from its own experience. From robot control to autonomous navigation, Reinforcement Learning algorithms have been applied to address increasing difficult problems. In recent studies, a number of papers have shown great success of RL in the field of production control, fi...
متن کاملEmphatic Temporal-Difference Learning
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a particular way, these algorithms become stable and convergent under off-policy training with linea...
متن کاملDiscrete Sequential Prediction of Continuous Actions for Deep RL
It has long been assumed that high dimensional continuous control problems cannot be solved effectively by discretizing individual dimensions of the action space due to the exponentially large number of bins over which policies would have to be learned. In this paper, we draw inspiration from the recent success of sequenceto-sequence models for structured prediction problems to develop policies...
متن کاملSymmetry Detection and Exploitation for Function Approximation in Deep RL
With recent advances in the use of deep networks for complex reinforcement learning (RL) tasks which require large amounts of training data, ensuring sample efficiency has become an important problem. In this work we introduce a novel method to detect environment symmetries using reward trails observed during episodic experience. Next we provide a framework to incorporate the discovered symmetr...
متن کاملOn Convergence of Emphatic Temporal-Difference Learning
We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces. Such algorithms were recently proposed by Sutton, Mahmood, and White (2015) as an improved solution to the problem of divergence of off-policy temporal-difference learning with linear function approximation. We present in this paper the first convergence...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2022
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v36i6.20660